Introduction

Description of the dataset:

Forced Expiratory Volume (FEV) is an index of pulmonary function that measures the volume of the air expelled after one second of constant effort. The data contains the determinations of FEB on 654 children ages 6 – 22 who are seen in childhood respiratory disease study in 1980 in East Boston, Massachusetts. The data are part of a larger study to follow the change in pulmonary function over time in children.

Variables in the Dataset:
VariableName Category Description
ID Numeric Uniquely Identified Row
Age Numeric Age of the child
Height Numeric Height of the child in inches
Sex Categorical Sex of the child
Smoker Categorical Whether the child is a non-smoker or current smoker
FEV Numeric FEV of the child in litres
Background information of the dataset:

The data contains the determinations of FEB on 654 children ages 6 – 22 who are seen in childhood respiratory disease study in 1980 in East Boston, Massachusetts. The data are part of a larger study to follow the change in pulmonary function over time in children

Note: No citation required for this source (http://www.statsci.org/data/general/fev.html)

Why is it interesting:

This dataset has chosen by us because of personal interests, whether we can predict the child’s FEV with the help of the available predictor, rather going for a Pulmonary Function Test, which will identify any pulmonary disease of a child. And secondly, to do statistical analysis on what are the predictors responsible to increase / decrease the pulmonary function of the child and try to find answer on these lines.

Why are we creating a model of this Data:

In order to predict the child’s Forced Exporatory Volume (FEV) based on the Child’s Age, Height, Sex and whether the child is a smoker or not. This will help for kick-start the treatment of the child based on the FEV reading, rather going throgh the Pulmonary Function test.

Goal of the Model:

The goal of the model is to find the best model after going through the several methods, which predict the child’s FEV with minimal error. The best model would have the lowest Root Mean Square Error (RMSE) against Leave One Out cross validation (LOOCV).

Methods

We will be doing several data analysis and methods in this section, where each section will be describing what’s the part of the tasks.

Load FEV data, Observations:

childfev = read.csv("http://www.statsci.org/data/general/fev.txt",
                    sep = "\t",
                    quote = "\"",
                    comment.char = "",
                    stringsAsFactors = FALSE)
childfev$Sex = as.factor(childfev$Sex)
childfev$Smoker = as.factor(childfev$Smoker)
str(childfev)
## 'data.frame':    654 obs. of  6 variables:
##  $ ID    : int  301 451 501 642 901 1701 1752 1753 1901 1951 ...
##  $ Age   : int  9 8 7 9 9 8 6 6 8 9 ...
##  $ FEV   : num  1.71 1.72 1.72 1.56 1.9 ...
##  $ Height: num  57 67.5 54.5 53 57 61 58 56 58.5 60 ...
##  $ Sex   : Factor w/ 2 levels "Female","Male": 1 1 1 2 2 1 1 1 1 1 ...
##  $ Smoker: Factor w/ 2 levels "Current","Non": 2 2 2 2 2 2 2 2 2 2 ...
dim(childfev)
## [1] 654   6
head(childfev$FEV,10)
##  [1] 1.708 1.724 1.720 1.558 1.895 2.336 1.919 1.415 1.987 1.942

From the dataset, it observered data, Age, Heigh is a numerical variable, where as Sex / Smoker field is a categorical variable. The FEV is the numerical response variable.

**Load FEV data:**
pairs(childfev[c('Age','FEV','Height','Sex','Smoker')], col = "darkgrey")

From the pairs plot, it sees there is a clear signs of linear releationship between FEV and Height. However there is no clear sign of linear relationship between FEV and Age variable. The 2 categorical variable Sex and Smoker seems to have 2 distinct data. Let’s explore this

Let’s check the Correlation Matrix to explore what’s the co-relation among the variables

cor(childfev[c('Age','Height','FEV')])
##           Age Height    FEV
## Age    1.0000 0.7919 0.7565
## Height 0.7919 1.0000 0.8681
## FEV    0.7565 0.8681 1.0000

It seems that what the pair plot shows seems correct, from the corelation matrix it also suggests that Age and Height are higly corelation with FEV response, as well as among it self, We need to explore the variance inflation factor (VIF) while creating our model in further.

Categorical Data of Sex and Smoker
table(childfev$Sex)
## 
## Female   Male 
##    318    336
table(childfev$Smoker)
## 
## Current     Non 
##      65     589

MLR FEV data:

Let’s start with the Multiple Linear Regression (MLR) against all the predictor (Except the ID), to predict the FEV as the response

mlr_model = lm(FEV ~ Age + Height + Sex + Smoker, data = childfev)
summary(mlr_model)
## 
## Call:
## lm(formula = FEV ~ Age + Height + Sex + Smoker, data = childfev)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.3766 -0.2503  0.0089  0.2559  1.9205 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -4.54422    0.23205  -19.58  < 2e-16 ***
## Age          0.06551    0.00949    6.90  1.2e-11 ***
## Height       0.10420    0.00476   21.90  < 2e-16 ***
## SexMale      0.15710    0.03321    4.73  2.7e-06 ***
## SmokerNon    0.08725    0.05925    1.47     0.14    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.412 on 649 degrees of freedom
## Multiple R-squared:  0.775,  Adjusted R-squared:  0.774 
## F-statistic:  560 on 4 and 649 DF,  p-value: <2e-16
Assumption of the Model

Let’s verify the Assumption of the Model - Constant Varience - Normality

Let’s plot Plotted Versus Residul Plot to see the distribution

par(mfrow = c(1,2))
plot(resid(mlr_model)~fitted(mlr_model), ylab = "residuals", xlab = "fitted", cex = 1, pch = 1, col = "darkgrey", main = "Residuals Versis Fitted Model")
abline(h = 0, lwd = 2, col = "darkorange")

qqnorm(resid(mlr_model), cex = 1, pch = 1, col = "darkgrey")
qqline(resid(mlr_model), lwd = 2, col = "darkorange")

From the above residuals versus fitted plot, it seems the constant variance assumption is suspect, as the residuals are not equally distributed.

This is the same case for the Q-Q plot, where both the tail seems to the a little fat tails, which tells us that normality assuption seems suspect for the model.

Let’s do the BP Test and Shapiro Test to gather some more evidence

bptest(mlr_model)
## 
##  studentized Breusch-Pagan test
## 
## data:  mlr_model
## BP = 65, df = 4, p-value = 3e-13

The BP Test is very low which confirms that constant variance is suspect

shapiro.test(resid(mlr_model))
## 
##  Shapiro-Wilk normality test
## 
## data:  resid(mlr_model)
## W = 0.99, p-value = 0.0002

The shapiro test also seems lkow, which confirms that the normality assumption also seems suspect

RSS, RMSE and RMSE LOOCV

Let’e calculate the RSS, RMSE and RMSE LOOCV of the model RSS

sum((resid(mlr_model))^2)
## [1] 110.3

RMSE

sqrt(mean(sum((resid(mlr_model))^2)))
## [1] 10.5

RMSE LOOCV

sqrt(mean((resid(mlr_model)/(1-hatvalues(mlr_model)))^2))
## [1] 0.4145
Metric Value
RSS 110.2796
RMSE 10.5014
RMSE LOOCV 0.4145

Transformation of FEV (Logarithimic):

Let’s explore whether transformation of the response variable FEV has any siginificance to improve the score (low RMSE LOOCV)

Let’s first plot the histogram of the FEV response variable with and without the logarathimic value

par(mfrow=c(1,2))
hist(childfev$FEV,
     xlab = 'FEV response Variable Data',
     probability = TRUE,
     breaks = 20,
     col = "darkgrey",
     name = "Original FEV response")
## Warning in plot.window(xlim, ylim, "", ...): "name" is not a graphical parameter
## Warning in title(main = main, sub = sub, xlab = xlab, ylab = ylab, ...): "name"
## is not a graphical parameter
## Warning in axis(1, ...): "name" is not a graphical parameter
## Warning in axis(2, ...): "name" is not a graphical parameter
hist(log(childfev$FEV),
     xlab = 'log(FEV) response Variable Data',     
     breaks = 20,
     col = "darkgrey",
     name = "Logarithimic FEV response")
## Warning in plot.window(xlim, ylim, "", ...): "name" is not a graphical parameter
## Warning in title(main = main, sub = sub, xlab = xlab, ylab = ylab, ...): "name"
## is not a graphical parameter
## Warning in axis(1, ...): "name" is not a graphical parameter
## Warning in axis(2, ...): "name" is not a graphical parameter

It seems that after the logarithimic transformation the FEV response doesn’t seems to be normal distibution. Hence let’s not explore the lograthimic transformation for the FEV response variable

Transformation of FEV (Polynomial):

Let’s explore the polynomial transformation of the predictor to see whether we can able to find the best model which lower the RMSE LOOCV

quadriatic Transformation of the Numerical Predictors:
quad_model = lm(FEV ~ Sex + Smoker + poly(Age,2) + poly(Height,2), data = childfev)
summary(quad_model)
## 
## Call:
## lm(formula = FEV ~ Sex + Smoker + poly(Age, 2) + poly(Height, 
##     2), data = childfev)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.6065 -0.2280  0.0049  0.2250  1.7957 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        2.4622     0.0550   44.79  < 2e-16 ***
## SexMale            0.0952     0.0329    2.90   0.0039 ** 
## SmokerNon          0.1396     0.0575    2.43   0.0154 *  
## poly(Age, 2)1      4.9250     0.7568    6.51  1.5e-10 ***
## poly(Age, 2)2      0.5473     0.5435    1.01   0.3142    
## poly(Height, 2)1  15.5887     0.7807   19.97  < 2e-16 ***
## poly(Height, 2)2   2.8550     0.4969    5.75  1.4e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.395 on 647 degrees of freedom
## Multiple R-squared:  0.794,  Adjusted R-squared:  0.792 
## F-statistic:  416 on 6 and 647 DF,  p-value: <2e-16

It seems the quadratic model is significance with low p value <2e-16, including all the predictors are significance though except Age where the p value is around

Let’s check the Assumption of the model

Assumption of the Model

Let’s verify the Assumption of the Model - Constant Varience - Normality

Let’s plot Plotted Versus Residul Plot to see the distribution

par(mfrow = c(1,2))
plot(resid(quad_model)~fitted(quad_model), ylab = "residuals", xlab = "fitted", cex = 1, pch = 1, col = "darkgrey", main = "Residuals Versis Fitted Model - Quad")
abline(h = 0, lwd = 2, col = "darkorange")

qqnorm(resid(quad_model), cex = 1, pch = 1, col = "darkgrey")
qqline(resid(quad_model), lwd = 2, col = "darkorange")

From the above residuals versus fitted plot, though it seems better than that of Multiple Linear Regression, it’s not doing good job for the higher fitted values. It seems as the fitted value goes higher, the residual seems to be wider, hence the constant variance assumption is suspect

This is the same case for the Q-Q plot, where both the tail seems to the a little fat tails, which tells us that normality assuption seems suspect for the model.

Let’s do the BP Test and Shapiro Test to gather some more evidence

bptest(quad_model)
## 
##  studentized Breusch-Pagan test
## 
## data:  quad_model
## BP = 81, df = 6, p-value = 3e-15

The BP Test is very low which confirms that constant variance is suspect

shapiro.test(resid(quad_model))
## 
##  Shapiro-Wilk normality test
## 
## data:  resid(quad_model)
## W = 0.99, p-value = 1e-05

The shapiro test also seems lkow, which confirms that the normality assumption also seems suspect

RSS, RMSE and RMSE LOOCV

Let’e calculate the RSS, RMSE and RMSE LOOCV of the model RSS

sum((resid(quad_model))^2)
## [1] 101

RMSE

sqrt(mean(sum((resid(quad_model))^2)))
## [1] 10.05

RMSE LOOCV

sqrt(mean((resid(quad_model)/(1-hatvalues(quad_model)))^2))
## [1] 0.398
Metric Value
RSS 100.989
RMSE 10.049
RMSE LOOCV 0.398
The RMSE LOOCV seems better than that of Multiple Linear Regression. We will compare at the last while making the decission around models
Cubic Transformation of the Numerical Predictors:
cube_model = lm(FEV ~ Sex + Smoker + poly(Age,3) + poly(Height,3), data = childfev)
summary(cube_model)
## 
## Call:
## lm(formula = FEV ~ Sex + Smoker + poly(Age, 3) + poly(Height, 
##     3), data = childfev)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.6342 -0.2264  0.0074  0.2303  1.7741 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        2.4557     0.0556   44.19  < 2e-16 ***
## SexMale            0.0978     0.0333    2.94   0.0034 ** 
## SmokerNon          0.1453     0.0578    2.51   0.0122 *  
## poly(Age, 3)1      4.8895     0.7570    6.46  2.1e-10 ***
## poly(Age, 3)2      0.7797     0.5626    1.39   0.1662    
## poly(Age, 3)3     -0.7317     0.4667   -1.57   0.1174    
## poly(Height, 3)1  15.6558     0.7817   20.03  < 2e-16 ***
## poly(Height, 3)2   2.4713     0.5545    4.46  9.8e-06 ***
## poly(Height, 3)3   0.3618     0.4253    0.85   0.3953    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.395 on 645 degrees of freedom
## Multiple R-squared:  0.795,  Adjusted R-squared:  0.793 
## F-statistic:  313 on 8 and 645 DF,  p-value: <2e-16

It seems the quadratic model is significance with low p value <2e-16, including all the predictors are significance though except Age where the p value is around

Let’s check the Assumption of the model

Assumption of the Model

Let’s verify the Assumption of the Model - Constant Varience - Normality

Let’s plot Plotted Versus Residul Plot to see the distribution

par(mfrow = c(1,2))
plot(resid(cube_model)~fitted(cube_model), ylab = "residuals", xlab = "fitted", cex = 1, pch = 1, col = "darkgrey", main = "Residuals Versis Fitted Model - Cube")
abline(h = 0, lwd = 2, col = "darkorange")

qqnorm(resid(cube_model), cex = 1, pch = 1, col = "darkgrey")
qqline(resid(cube_model), lwd = 2, col = "darkorange")

From the above residuals versus fitted plot, though it seems better than that of Multiple Linear Regression, it’s not doing good job for the higher fitted values. It seems as the fitted value goes higher, the residual seems to be wider, hence the constant variance assumption is suspect

This is the same case for the Q-Q plot, where both the tail seems to the a little fat tails, which tells us that normality assuption seems suspect for the model.

Let’s do the BP Test and Shapiro Test to gather some more evidence

bptest(cube_model)
## 
##  studentized Breusch-Pagan test
## 
## data:  cube_model
## BP = 84, df = 8, p-value = 6e-15

The BP Test is very low which confirms that constant variance is suspect

shapiro.test(resid(cube_model))
## 
##  Shapiro-Wilk normality test
## 
## data:  resid(cube_model)
## W = 0.99, p-value = 1e-05

The shapiro test also seems lkow, which confirms that the normality assumption also seems suspect

RSS, RMSE and RMSE LOOCV

Let’e calculate the RSS, RMSE and RMSE LOOCV of the model RSS

sum((resid(cube_model))^2)
## [1] 100.6

RMSE

sqrt(mean(sum((resid(cube_model))^2)))
## [1] 10.03

RMSE LOOCV

sqrt(mean((resid(cube_model)/(1-hatvalues(cube_model)))^2))
## [1] 0.3984
Metric Value
RSS 100.5859
RMSE 10.0293
RMSE LOOCV 0.3984
The RMSE LOOCV seems better than that of Multiple Linear Regression. We will compare at the last while making the decission around models
Higher Order Transformation of the Numerical Predictors:
high_model = lm(FEV ~ Sex + Smoker + poly(Age,6) + poly(Height,6), data = childfev)
summary(high_model)
## 
## Call:
## lm(formula = FEV ~ Sex + Smoker + poly(Age, 6) + poly(Height, 
##     6), data = childfev)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.7062 -0.2302  0.0136  0.2248  1.7638 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        2.4557     0.0562   43.69  < 2e-16 ***
## SexMale            0.0974     0.0340    2.87   0.0043 ** 
## SmokerNon          0.1455     0.0586    2.48   0.0133 *  
## poly(Age, 6)1      4.8282     0.7663    6.30  5.5e-10 ***
## poly(Age, 6)2      0.8241     0.5698    1.45   0.1486    
## poly(Age, 6)3     -0.7679     0.4742   -1.62   0.1058    
## poly(Age, 6)4      0.0457     0.4424    0.10   0.9177    
## poly(Age, 6)5      0.5243     0.4282    1.22   0.2212    
## poly(Age, 6)6     -0.1088     0.4211   -0.26   0.7962    
## poly(Height, 6)1  15.7181     0.7934   19.81  < 2e-16 ***
## poly(Height, 6)2   2.3834     0.5640    4.23  2.7e-05 ***
## poly(Height, 6)3   0.3608     0.4649    0.78   0.4379    
## poly(Height, 6)4  -0.1415     0.4348   -0.33   0.7450    
## poly(Height, 6)5  -0.2765     0.4186   -0.66   0.5092    
## poly(Height, 6)6  -0.8348     0.4050   -2.06   0.0397 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.395 on 639 degrees of freedom
## Multiple R-squared:  0.797,  Adjusted R-squared:  0.793 
## F-statistic:  180 on 14 and 639 DF,  p-value: <2e-16

It seems the quadratic model is significance with low p value <2e-16, including all the predictors are significance though except Age where the p value is around

Let’s check the Assumption of the model

Assumption of the Model

Let’s verify the Assumption of the Model - Constant Varience - Normality

Let’s plot Plotted Versus Residul Plot to see the distribution

par(mfrow = c(1,2))
plot(resid(high_model)~fitted(high_model), ylab = "residuals", xlab = "fitted", cex = 1, pch = 1, col = "darkgrey", main = "Residuals Versis Fitted Model - higher order Polynomial")
abline(h = 0, lwd = 2, col = "darkorange")

qqnorm(resid(high_model), cex = 1, pch = 1, col = "darkgrey")
qqline(resid(high_model), lwd = 2, col = "darkorange")

From the above residuals versus fitted plot, though it seems better than that of Multiple Linear Regression, it’s not doing good job for the higher fitted values. It seems as the fitted value goes higher, the residual seems to be wider, hence the constant variance assumption is suspect

This is the same case for the Q-Q plot, where both the tail seems to the a little fat tails, which tells us that normality assuption seems suspect for the model.

Let’s do the BP Test and Shapiro Test to gather some more evidence

bptest(high_model)
## 
##  studentized Breusch-Pagan test
## 
## data:  high_model
## BP = 92, df = 14, p-value = 2e-13

The BP Test is very low which confirms that constant variance is suspect

shapiro.test(resid(high_model))
## 
##  Shapiro-Wilk normality test
## 
## data:  resid(high_model)
## W = 0.99, p-value = 8e-06

The shapiro test also seems lkow, which confirms that the normality assumption also seems suspect

RSS, RMSE and RMSE LOOCV

Let’e calculate the RSS, RMSE and RMSE LOOCV of the model RSS

sum((resid(high_model))^2)
## [1] 99.47

RMSE

sqrt(mean(sum((resid(high_model))^2)))
## [1] 9.973

RMSE LOOCV

sqrt(mean((resid(high_model)/(1-hatvalues(high_model)))^2))
## [1] 0.4012
Metric Value
RSS 99.4667
RMSE 9.9733
RMSE LOOCV 0.4012

It seems that the model is not improving beyong degree 2 and infact it’s detorating after degree 2. We will explore the polynoamial with the interaction to see if there is any improvement. Also will create a bigger model and by the help of AIC and BIC to explore, whether we can improve the model.

Interaction between Predictors:

Let’s explore the Interaction between the predictor to see whether we can able to find the best model which lower the RMSE LOOCV

2 way Interaction:
two_int_model = lm(FEV ~ (Age + Height + Sex + Smoker) ^ 2, data = childfev)
summary(two_int_model)
## 
## Call:
## lm(formula = FEV ~ (Age + Height + Sex + Smoker)^2, data = childfev)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.4557 -0.2250  0.0074  0.2331  1.6724 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       -0.20138    1.42176   -0.14    0.887    
## Age               -0.38722    0.06897   -5.61  2.9e-08 ***
## Height             0.04381    0.02213    1.98    0.048 *  
## SexMale           -0.84442    0.49098   -1.72    0.086 .  
## SmokerNon         -0.09124    1.24797   -0.07    0.942    
## Age:Height         0.00640    0.00101    6.32  5.0e-10 ***
## Age:SexMale        0.00431    0.01851    0.23    0.816    
## Age:SmokerNon      0.06144    0.02379    2.58    0.010 *  
## Height:SexMale     0.01555    0.00951    1.63    0.103    
## Height:SmokerNon  -0.00878    0.01961   -0.45    0.654    
## SexMale:SmokerNon -0.03269    0.13175   -0.25    0.804    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.391 on 643 degrees of freedom
## Multiple R-squared:   0.8,   Adjusted R-squared:  0.797 
## F-statistic:  257 on 10 and 643 DF,  p-value: <2e-16

It seems the quadratic model is significance with low p value <2e-16, including all the predictors are significance though except Age where the p value is around

Let’s check the Assumption of the model

Assumption of the Model

Let’s verify the Assumption of the Model - Constant Varience - Normality

Let’s plot Plotted Versus Residul Plot to see the distribution

par(mfrow = c(1,2))
plot(resid(two_int_model)~fitted(two_int_model), ylab = "residuals", xlab = "fitted", cex = 1, pch = 1, col = "darkgrey", main = "Residuals Versis Fitted Model - Quad")
abline(h = 0, lwd = 2, col = "darkorange")

qqnorm(resid(two_int_model), cex = 1, pch = 1, col = "darkgrey")
qqline(resid(two_int_model), lwd = 2, col = "darkorange")

From the above residuals versus fitted plot, though it seems better than that of Multiple Linear Regression, it’s not doing good job for the higher fitted values. It seems as the fitted value goes higher, the residual seems to be wider, hence the constant variance assumption is suspect

This is the same case for the Q-Q plot, where both the tail seems to the a little fat tails, which tells us that normality assuption seems suspect for the model.

Let’s do the BP Test and Shapiro Test to gather some more evidence

bptest(two_int_model)
## 
##  studentized Breusch-Pagan test
## 
## data:  two_int_model
## BP = 88, df = 10, p-value = 1e-14

The BP Test is very low which confirms that constant variance is suspect

shapiro.test(resid(two_int_model))
## 
##  Shapiro-Wilk normality test
## 
## data:  resid(two_int_model)
## W = 0.99, p-value = 0.0003

The shapiro test also seems lkow, which confirms that the normality assumption also seems suspect

RSS, RMSE and RMSE LOOCV

Let’e calculate the RSS, RMSE and RMSE LOOCV of the model RSS

sum((resid(two_int_model))^2)
## [1] 98.14

RMSE

sqrt(mean(sum((resid(two_int_model))^2)))
## [1] 9.907

RMSE LOOCV

sqrt(mean((resid(two_int_model)/(1-hatvalues(two_int_model)))^2))
## [1] 0.3964
Metric Value
RSS 98.1409
RMSE 9.9066
RMSE LOOCV 0.3964

The RMSE LOOCV seems better than that of Multiple Linear Regression. We will compare at the last while making the decission around models

3 way Interaction:
three_int_model = lm(FEV ~ (Age + Height + Sex + Smoker) ^ 3, data = childfev)
summary(three_int_model)
## 
## Call:
## lm(formula = FEV ~ (Age + Height + Sex + Smoker)^3, data = childfev)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.4180 -0.2285  0.0059  0.2288  1.6306 
## 
## Coefficients:
##                          Estimate Std. Error t value Pr(>|t|)    
## (Intercept)               6.94328    7.52441    0.92  0.35648    
## Age                      -0.52738    0.55549   -0.95  0.34278    
## Height                   -0.06022    0.11584   -0.52  0.60331    
## SexMale                  -5.27953    3.09717   -1.70  0.08875 .  
## SmokerNon                -9.43485    7.46602   -1.26  0.20680    
## Age:Height                0.00807    0.00856    0.94  0.34640    
## Age:SexMale              -0.45320    0.14850   -3.05  0.00237 ** 
## Age:SmokerNon             0.48508    0.55069    0.88  0.37873    
## Height:SexMale            0.07826    0.04824    1.62  0.10524    
## Height:SmokerNon          0.13170    0.11503    1.14  0.25266    
## SexMale:SmokerNon         8.06596    2.70524    2.98  0.00298 ** 
## Age:Height:SexMale        0.00719    0.00213    3.38  0.00077 ***
## Age:Height:SmokerNon     -0.00630    0.00849   -0.74  0.45858    
## Age:SexMale:SmokerNon     0.00974    0.05426    0.18  0.85755    
## Height:SexMale:SmokerNon -0.12258    0.04275   -2.87  0.00427 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.383 on 639 degrees of freedom
## Multiple R-squared:  0.809,  Adjusted R-squared:  0.805 
## F-statistic:  193 on 14 and 639 DF,  p-value: <2e-16

It seems the quadratic model is significance with low p value <2e-16, including all the predictors are significance though except Age where the p value is around

Let’s check the Assumption of the model

Assumption of the Model

Let’s verify the Assumption of the Model - Constant Varience - Normality

Let’s plot Plotted Versus Residul Plot to see the distribution

par(mfrow = c(1,2))
plot(resid(three_int_model)~fitted(three_int_model), ylab = "residuals", xlab = "fitted", cex = 1, pch = 1, col = "darkgrey", main = "Residuals Versis Fitted Model - Quad")
abline(h = 0, lwd = 2, col = "darkorange")

qqnorm(resid(three_int_model), cex = 1, pch = 1, col = "darkgrey")
qqline(resid(three_int_model), lwd = 2, col = "darkorange")

From the above residuals versus fitted plot, though it seems better than that of Multiple Linear Regression, it’s not doing good job for the higher fitted values. It seems as the fitted value goes higher, the residual seems to be wider, hence the constant variance assumption is suspect

This is the same case for the Q-Q plot, where both the tail seems to the a little fat tails, which tells us that normality assuption seems suspect for the model.

Let’s do the BP Test and Shapiro Test to gather some more evidence

bptest(three_int_model)
## 
##  studentized Breusch-Pagan test
## 
## data:  three_int_model
## BP = 97, df = 14, p-value = 2e-14

The BP Test is very low which confirms that constant variance is suspect

shapiro.test(resid(three_int_model))
## 
##  Shapiro-Wilk normality test
## 
## data:  resid(three_int_model)
## W = 0.99, p-value = 0.002

The shapiro test also seems lkow, which confirms that the normality assumption also seems suspect

RSS, RMSE and RMSE LOOCV

Let’e calculate the RSS, RMSE and RMSE LOOCV of the model RSS

sum((resid(three_int_model))^2)
## [1] 93.89

RMSE

sqrt(mean(sum((resid(three_int_model))^2)))
## [1] 9.69

RMSE LOOCV

sqrt(mean((resid(three_int_model)/(1-hatvalues(three_int_model)))^2))
## [1] 0.3907
Metric Value
RSS 93.8929
RMSE 9.6898
RMSE LOOCV 0.3907

The RMSE LOOCV seems better than that of Multiple Linear Regression. We will compare at the last while making the decission around models

4 way Interaction:
four_int_model = lm(FEV ~ (Age + Height + Sex + Smoker) ^ 4, data = childfev)
summary(four_int_model)
## 
## Call:
## lm(formula = FEV ~ (Age + Height + Sex + Smoker)^4, data = childfev)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.4258 -0.2254  0.0028  0.2256  1.6281 
## 
## Coefficients:
##                              Estimate Std. Error t value Pr(>|t|)  
## (Intercept)                   25.7105    13.3060    1.93    0.054 .
## Age                           -1.9541     1.0023   -1.95    0.052 .
## Height                        -0.3497     0.2051   -1.70    0.089 .
## SexMale                      -31.1727    15.4641   -2.02    0.044 *
## SmokerNon                    -28.3251    13.3328   -2.12    0.034 *
## Age:Height                     0.0301     0.0155    1.95    0.052 .
## Age:SexMale                    1.6018     1.2116    1.32    0.187  
## Age:SmokerNon                  1.9294     1.0083    1.91    0.056 .
## Height:SexMale                 0.4734     0.2362    2.00    0.045 *
## Height:SmokerNon               0.4232     0.2057    2.06    0.040 *
## SexMale:SmokerNon             34.1547    15.5034    2.20    0.028 *
## Age:Height:SexMale            -0.0241     0.0184   -1.31    0.192  
## Age:Height:SmokerNon          -0.0286     0.0156   -1.84    0.067 .
## Age:SexMale:SmokerNon         -2.0718     1.2192   -1.70    0.090 .
## Height:SexMale:SmokerNon      -0.5209     0.2370   -2.20    0.028 *
## Age:Height:SexMale:SmokerNon   0.0317     0.0186    1.71    0.088 .
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.383 on 638 degrees of freedom
## Multiple R-squared:  0.81,   Adjusted R-squared:  0.805 
## F-statistic:  181 on 15 and 638 DF,  p-value: <2e-16

It seems the quadratic model is significance with low p value <2e-16, including all the predictors are significance though except Age where the p value is around

Let’s check the Assumption of the model

Assumption of the Model

Let’s verify the Assumption of the Model - Constant Varience - Normality

Let’s plot Plotted Versus Residul Plot to see the distribution

par(mfrow = c(1,2))
plot(resid(four_int_model)~fitted(four_int_model), ylab = "residuals", xlab = "fitted", cex = 1, pch = 1, col = "darkgrey", main = "Residuals Versis Fitted Model - Quad")
abline(h = 0, lwd = 2, col = "darkorange")

qqnorm(resid(four_int_model), cex = 1, pch = 1, col = "darkgrey")
qqline(resid(four_int_model), lwd = 2, col = "darkorange")

From the above residuals versus fitted plot, though it seems better than that of Multiple Linear Regression, it’s not doing good job for the higher fitted values. It seems as the fitted value goes higher, the residual seems to be wider, hence the constant variance assumption is suspect

This is the same case for the Q-Q plot, where both the tail seems to the a little fat tails, which tells us that normality assuption seems suspect for the model.

Let’s do the BP Test and Shapiro Test to gather some more evidence

bptest(four_int_model)
## 
##  studentized Breusch-Pagan test
## 
## data:  four_int_model
## BP = 98, df = 15, p-value = 3e-14

The BP Test is very low which confirms that constant variance is suspect

shapiro.test(resid(four_int_model))
## 
##  Shapiro-Wilk normality test
## 
## data:  resid(four_int_model)
## W = 0.99, p-value = 0.002

The shapiro test also seems lkow, which confirms that the normality assumption also seems suspect

RSS, RMSE and RMSE LOOCV

Let’e calculate the RSS, RMSE and RMSE LOOCV of the model RSS

sum((resid(four_int_model))^2)
## [1] 93.47

RMSE

sqrt(mean(sum((resid(four_int_model))^2)))
## [1] 9.668

RMSE LOOCV

sqrt(mean((resid(four_int_model)/(1-hatvalues(four_int_model)))^2))
## [1] 0.3918
Metric Value
RSS 93.4651
RMSE 9.6677
RMSE LOOCV 0.3918

The RMSE LOOCV seems better than that of Multiple Linear Regression. We will compare at the last while making the decission around models

Model- Big Model, Stepwise via AIC / BIC:

Let’s create a big model, with Polynomial of degree 3 and 3 way interaction between categorical-to-categorical, categorical-to-numeric, numeric-to-numeric and see it’s score. Also reduced the model via AIC and BIC to see any improvement in the score

Big Model:
big_model = lm(FEV ~ (Age + Height + Sex + Smoker) ^ 3 + poly(Age,3) + poly(Height,3), data = childfev)
summary(big_model)
## 
## Call:
## lm(formula = FEV ~ (Age + Height + Sex + Smoker)^3 + poly(Age, 
##     3) + poly(Height, 3), data = childfev)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.2966 -0.2289  0.0028  0.2179  1.6014 
## 
## Coefficients: (2 not defined because of singularities)
##                           Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                8.09658    7.56454    1.07  0.28488    
## Age                       -0.61839    0.56434   -1.10  0.27359    
## Height                    -0.08093    0.11699   -0.69  0.48929    
## SexMale                   -3.88732    3.17077   -1.23  0.22066    
## SmokerNon                -10.79496    7.65232   -1.41  0.15883    
## poly(Age, 3)1                   NA         NA      NA       NA    
## poly(Age, 3)2             -0.46787    1.14254   -0.41  0.68231    
## poly(Age, 3)3             -0.18683    0.54924   -0.34  0.73385    
## poly(Height, 3)1                NA         NA      NA       NA    
## poly(Height, 3)2          -0.13970    1.02831   -0.14  0.89198    
## poly(Height, 3)3          -0.95598    0.46743   -2.05  0.04125 *  
## Age:Height                 0.00967    0.00876    1.10  0.26987    
## Age:SexMale               -0.62552    0.16678   -3.75  0.00019 ***
## Age:SmokerNon              0.63679    0.57480    1.11  0.26835    
## Height:SexMale             0.06013    0.04922    1.22  0.22229    
## Height:SmokerNon           0.15546    0.11798    1.32  0.18807    
## SexMale:SmokerNon          7.71549    2.71796    2.84  0.00467 ** 
## Age:Height:SexMale         0.00958    0.00238    4.03 0.000063 ***
## Age:Height:SmokerNon      -0.00886    0.00887   -1.00  0.31812    
## Age:SexMale:SmokerNon      0.03389    0.05575    0.61  0.54353    
## Height:SexMale:SmokerNon  -0.12128    0.04282   -2.83  0.00477 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.383 on 635 degrees of freedom
## Multiple R-squared:  0.81,   Adjusted R-squared:  0.805 
## F-statistic:  151 on 18 and 635 DF,  p-value: <2e-16

It seems the quadratic model is significance with low p value <2e-16, including all the predictors are significance though except Age where the p value is around

Let’s check the Assumption of the model

Assumption of the Model

Let’s verify the Assumption of the Model - Constant Varience - Normality

Let’s plot Plotted Versus Residul Plot to see the distribution

par(mfrow = c(1,2))
plot(resid(big_model)~fitted(big_model), ylab = "residuals", xlab = "fitted", cex = 1, pch = 1, col = "darkgrey", main = "Residuals Versis Fitted Model - Quad")
abline(h = 0, lwd = 2, col = "darkorange")

qqnorm(resid(big_model), cex = 1, pch = 1, col = "darkgrey")
qqline(resid(big_model), lwd = 2, col = "darkorange")

From the above residuals versus fitted plot, though it seems better than that of Multiple Linear Regression, it’s not doing good job for the higher fitted values. It seems as the fitted value goes higher, the residual seems to be wider, hence the constant variance assumption is suspect

This is the same case for the Q-Q plot, where both the tail seems to the a little fat tails, which tells us that normality assuption seems suspect for the model.

Let’s do the BP Test and Shapiro Test to gather some more evidence

bptest(big_model)
## 
##  studentized Breusch-Pagan test
## 
## data:  big_model
## BP = 100, df = 18, p-value = 4e-14

The BP Test is very low which confirms that constant variance is suspect

shapiro.test(resid(big_model))
## 
##  Shapiro-Wilk normality test
## 
## data:  resid(big_model)
## W = 0.99, p-value = 0.003

The shapiro test also seems lkow, which confirms that the normality assumption also seems suspect

RSS, RMSE and RMSE LOOCV

Let’e calculate the RSS, RMSE and RMSE LOOCV of the model RSS

sum((resid(big_model))^2)
## [1] 93.05

RMSE

sqrt(mean(sum((resid(big_model))^2)))
## [1] 9.646

RMSE LOOCV

sqrt(mean((resid(big_model)/(1-hatvalues(big_model)))^2))
## [1] 0.3925
Metric Value
RSS 93.0501
RMSE 9.6462
RMSE LOOCV 0.3925

The RMSE LOOCV seems better than that of Multiple Linear Regression. We will compare at the last while making the decission around models

Reduced Big Model via AIC :
big_model = lm(FEV ~ (Age + Height + Sex + Smoker) ^ 3 + poly(Age,3) + poly(Height,3), data = childfev)
big_model_aic = step(big_model, direction = "backward", trace = 0)
summary(big_model_aic)
## 
## Call:
## lm(formula = FEV ~ Age + Height + Sex + Smoker + poly(Height, 
##     3) + Age:Height + Age:Sex + Age:Smoker + Height:Sex + Height:Smoker + 
##     Sex:Smoker + Age:Height:Sex + Height:Sex:Smoker, data = childfev)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.3035 -0.2190 -0.0021  0.2167  1.6067 
## 
## Coefficients: (1 not defined because of singularities)
##                            Estimate Std. Error t value Pr(>|t|)    
## (Intercept)               0.2954728  2.5352999    0.12  0.90726    
## Age                      -0.0097005  0.1592017   -0.06  0.95143    
## Height                    0.0423046  0.0388440    1.09  0.27653    
## SexMale                  -3.4622592  3.0293083   -1.14  0.25350    
## SmokerNon                -3.5027204  1.9922327   -1.76  0.07919 .  
## poly(Height, 3)1                 NA         NA      NA       NA    
## poly(Height, 3)2          0.0748272  0.8601746    0.09  0.93071    
## poly(Height, 3)3         -0.9398997  0.4368687   -2.15  0.03182 *  
## Age:Height                0.0000545  0.0024431    0.02  0.98221    
## Age:SexMale              -0.5785139  0.1482446   -3.90  0.00011 ***
## Age:SmokerNon             0.0734381  0.0245113    3.00  0.00284 ** 
## Height:SexMale            0.0480355  0.0467074    1.03  0.30414    
## Height:SmokerNon          0.0408410  0.0304463    1.34  0.18026    
## SexMale:SmokerNon         7.1999317  2.6065228    2.76  0.00591 ** 
## Age:Height:SexMale        0.0093680  0.0023237    4.03 0.000062 ***
## Height:SexMale:SmokerNon -0.1077336  0.0396057   -2.72  0.00670 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.382 on 639 degrees of freedom
## Multiple R-squared:  0.81,   Adjusted R-squared:  0.806 
## F-statistic:  195 on 14 and 639 DF,  p-value: <2e-16

It seems the quadratic model is significance with low p value <2e-16, including all the predictors are significance though except Age where the p value is around

Let’s check the Assumption of the model

Assumption of the Model

Let’s verify the Assumption of the Model - Constant Varience - Normality

Let’s plot Plotted Versus Residul Plot to see the distribution

par(mfrow = c(1,2))
plot(resid(big_model_aic)~fitted(big_model_aic), ylab = "residuals", xlab = "fitted", cex = 1, pch = 1, col = "darkgrey", main = "Residuals Versis Fitted Model - Quad")
abline(h = 0, lwd = 2, col = "darkorange")

qqnorm(resid(big_model_aic), cex = 1, pch = 1, col = "darkgrey")
qqline(resid(big_model_aic), lwd = 2, col = "darkorange")

From the above residuals versus fitted plot, though it seems better than that of Multiple Linear Regression, it’s not doing good job for the higher fitted values. It seems as the fitted value goes higher, the residual seems to be wider, hence the constant variance assumption is suspect

This is the same case for the Q-Q plot, where both the tail seems to the a little fat tails, which tells us that normality assuption seems suspect for the model.

Let’s do the BP Test and Shapiro Test to gather some more evidence

bptest(big_model_aic)
## 
##  studentized Breusch-Pagan test
## 
## data:  big_model_aic
## BP = 94, df = 14, p-value = 6e-14

The BP Test is very low which confirms that constant variance is suspect

shapiro.test(resid(big_model_aic))
## 
##  Shapiro-Wilk normality test
## 
## data:  resid(big_model_aic)
## W = 0.99, p-value = 0.003

The shapiro test also seems lkow, which confirms that the normality assumption also seems suspect

RSS, RMSE and RMSE LOOCV

Let’e calculate the RSS, RMSE and RMSE LOOCV of the model RSS

sum((resid(big_model_aic))^2)
## [1] 93.29

RMSE

sqrt(mean(sum((resid(big_model_aic))^2)))
## [1] 9.659

RMSE LOOCV

sqrt(mean((resid(big_model_aic)/(1-hatvalues(big_model_aic)))^2))
## [1] 0.3897
Metric Value
RSS 93.2936
RMSE 9.6589
RMSE LOOCV 0.3897

The RMSE LOOCV seems better than that of Multiple Linear Regression. We will compare at the last while making the decission around models